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1.   Introduction 

There  are  two  critical  design  issues  that  arise  in  connection  with  the 
implementation  of  the  Ultracomputer.  The  first  is  concerned  with  the  latency  and 
bandwidth  of  the  PE-to-memory  interconnection  network,  which  must  be 
adequate  to  support  the  PEs.  The  second  is  the  avoidance  of  serial  bottlenecks 
due  to  synchronization.  Unfortunately  the  solution  to  the  second,  the  combining 
of  fetch-and-add  operations  in  the  network,  has  an  adverse  effect  on  the  latency 
and  bandwidth.  This  is  particularly  true  when  achievement  of  adequate  network 
latency  requires  technology  which  is  gate-limited. 

We  propose  here  an  alternative  architecture  for  combining  F&A  operations. 
The  idea  is  to  use  two  networks  to  connect  the  PEs  and  the  memories.  The  first 
is  the  standard  omega  network,  which  handles  all  loads,  all  stores,  and  some 
F&As,  but  without  combinings.  The  second  is  a  smaller  network  which  handles 
only  F&As,  and  provides  high-speed  combining.  We  discuss  here  the 
organization  of  the  second  network.  The  origin  of  this  approach  is  the 
observation  that  network  requirements  for  F&A  instructions  which  require 
combining  are  quite  different  from  loads,  stores,  and  other  F&As.  In  particular, 
if  there  is  a  lot  of  combining,  only  a  relatively  small  number  of  addresses  are 
involved.  Furthermore,  if  there  are  only  as  many  addresses  as  stages  in  the 
network,  they  can  be  distributed  synchronously  in  the  network  rather  than 
randomly. 

2.   Overview 

The  combining  network  we  propose  is  a  tree  structure,  with  the  leaves 
interfacing  to  a  PE,  and  the  root  interfacing  to  a  second  port  on  all  memories. 
The  timing  in  the  PE-network  interface  uses  a  (k-f  l)-phase  clock,  where  k  is  of 
the  order  of  log(n)  for  a  system  with  n  PEs  (but  see  below  for  a  more  detailed 
discussion  of  the  value  of  k).  During  the  first  k  time  slots  each  PE  is  provided 
with  the  address  of  a  F&A  location;  if  the  PE  has  a  F&A  request  for  this 


location,  it  inserts  it  into  the  network.  All  requests  submitted  during  any  time 
slot  will  be  combined.  The  network  is  synchronous  in  the  sense  that  all  messages 
at  one  level  are  F&As  for  the  same  address,  so  we  refer  to  it  as  a  Synchronous 
Combining  Network  (SCN). 

Responses  from  the  SCN  will  arrive  back  at  the  PE  a  constant  time  after  the 
insertion  of  the  request.  This  will  be  the  minimum  time  necessary  for 
transmission  though  the  SCN,  memory  access,  arithmetic,  memory  rewrite,  and 
retransmission. 

Each  processor  keeps  a  copy  of  the  set  of  k  addresses  which  are  currently 
acceptable  by  the  SCN.  These  are  stored  in  a  local  address  buffer  (AB).  During 
the  (k+l)th  slot  the  SCN  will  broadcast  updates  to  the  AB,  in  response  to 
'nominations'  from  the  PEs.  The  objective  of  the  update  will  be  to  keep  the  AB 
containing  the  k  optimum  F&A  addresses. 

2.1.   Updating 

We  propose  here  a  number  of  schemes  for  updating  the  AB.  The  objective 
of  these  schemes  is  to  capture  in  the  AB  the  addresses  of  'hot  spots'  in  the  F&A 
address  space,  i.e.  those  addresses  for  which  we  would  like  to  combine  F&As. 
This  is  done  by  a  form  of  Least  Recently  Often  Used  (LROU)  policy. 

Our  approximations  will  in  effect  implement  a  FIFO  strategy  with  insertion 
of  random  new  requests.  Both  use  the  idea  of  permitting  the  PEs  to  'nominate' 
addresses  for  inclusion  in  the  F&A  address  list.  This  is  done  as  follows. 

In  the  (k+l)th  slot,  each  PE  outputs  to  the  SCN  a  'nomination'.  A 
nomination  is  either  null,  indicating  that  the  PE  has  no  interest  in  putting  a  F&A 
address  into  the  Ust,  or  is  the  address  of  a  F&A  location.  If  there  is  a  non-null 
nomination,  the  SCN  chooses  one  such,  and  each  PE  removes  the  entry  in  AB[1], 
moves  the  lower  entries  up  one  place,  and  puts  the  new  entry  in  AB[k].  The  total 
time  for  this  update  operation,  including  SCN  delays,  will  be  less  than  the  time  to 
execute  one  F&A,  so  can  be  accomplished  before  the  time  for  the  next  (k+l)th 
slot.  In  Nomination  Algorithm  1,  each  PE  with  an  unsatisfied  F&A  whose 
address  is  not  in  the  AB  nominates  its  address;  other  PEs  output  the  null 
nomination. 

In  Nomination  Algorithm  2,  each  PE  with  an  unsatisfied  F&A  nominates  its 
address,  and  other  PEs  output  the  null  nomination.  Note  that  this  algorithm  may 
put  more  than  one  copy  of  an  address  in  the  AB,  so  it  will  presumably  be  less 
likely  that  a  very  hot  spot  will  not  be  represented;  however,  it  will  probably  not 
do  as  well  in  the  case  of  a  larger  number  of  moderately  hot  spots.  This  also 
requires  less  hardware  than  Algorithm  1. 

In  Nomination  Algorithm  3,  each  PE  nominates  the  address  of  its  last  F&A, 
satisfied  or  unsatisfied,  within  the  last  time  quantum.  There  would  seem  to  be 
Uttle  to  be  said  for  this,  except  that  it  provides  more  historical  information,  so  a 
hot  spot  that  has  temporarily  cooled  will  have  a  better  chance  of  being  kept. 
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2.2.  Nomination 

Selection  of  nominations  can  be  done  pseudo- randomly,  by  a  number  of 
algorithms.  In  each  of  these  algorithms,  each  node  in  the  tree  outputs  null  to  its 
parent  node  if  all  inputs  from  child  nodes  are  null,  and  otherwise  some  selection 
of  its  non-null  inputs.  The  selection,  if  any,  surviving  at  the  root  node  is  then 
broadcast  down  the  tree  to  all  leaves. 

In  Selection  Algorithm  1,  the  selection  is  according  to  a  pseudo- random 
algorithm  local  to  each  node. 

In  Selection  Algorithm  2,  we  simulate  token-passing;  i.e.  the  processor  whose 
nomination  is  selected  is  the  processor  with  a  non-nuU  nomination  which  would 
be  next  after  the  previously  selected  processor  if  the  processors  were  to  be 
connected  in  a  ring.  This  is  done  as  follows.  Each  node  keeps  track  of  which  of 
its  children  had  their  nominations  propagated  up  the  tree  in  the  previous  selection 
cycle,  Such  children  will  be  marked  as  'winners'.  Each  new  nomination 
propagated  is  marked  either  null,  possible,  or  probable,  with  initial  nominations 
being  marked  possible.  The  algorithm  is  something  like  the  following: 

if  all  children  are  null,  propagate  null; 

if  one  child  is  marked  probable,  propagate  it 

marked  probable; 

if  two  or  more  children  are  marked  probable, 

propagate  the  lowest,  marked  probable; 

if  child  i  was  a  winner,  and  there  is  a 

non-nuU  child  j  with  j  >  =  i, 

propagate  the  child  with  the  lowest 

such  j,  marked  probable; 

otherwise  propagate  the  lowest  non-null  child, 

marked  possible. 

Minor  changes  at  the  root  would  be  necessary  to  reflect  the  fact  that  its  children 
are  regarded  as  cyclic. 

In  Selection  Algorithm  3,  each  node  passes  to  its  parent  not  only  the  address, 
but  also  the  number  of  times  it  was  nominated.  The  selection  is  the  address  with 
the  largest  nomination  count,  with  random  choices  as  in  Selection  Algorithms  1  or 
2  above  in  the  case  of  ties. 

2.3.   F&A 

A  PE  which  fails  to  fmd  its  F&A  address  in  AB  will  nominate  it  for 
inclusion,  and  then  has  a  choice  of  whether  to  wait  till  the  nomination  is 
successful.  First,  it  could  nominate  the  address  for  insertion  in  the  AB,  and  wait 
to  see  if  it  was  successful.  If  the  strategy  was  to  renominate  if  unsuccessful, 
starvation  might  be  a  serious  problem.  In  any  case,  even  if  the  PE  waits  for  the 
result  of  one  nomination  attempt,  this  might  be  slower  for  F&As  for  which 
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combining  is  not  critical,  such  as  programs  which  use  F&A  just  for  safety.  Note 
however  that  the  SCN  could  be  considerably  faster  than  the  standard  non- 
combining  network. 

An  alternative  is  for  the  PE  to  nominate  the  address  and  immediately  issue 
the  F&A  to  the  standard  non-combining  network.  The  danger  of  the  latter  is  that 
it  could  happen  that  a  number  of  highly  synchronized  tasks  could  all  do  this  at  the 
same  time,  giving  bad  serialization  problems.  This,  however,  might  turn  out  to 
be  very  rare  in  practice. 

This  suggests  that  it  might  be  worth  considering  the  value  of  two  types  of 
F&A  instruction,  one  which  might  need  to  be  combined,  and  one  which  would 
not.  Note  that  if  this  were  done,  it  might  not  be  unreasonable  to  separate  the 
combining  F&A  memory  from  the  main  memory.  This  would  further  reduce  the 
F&A  latency  time  and  perhaps  increase  the  value  of  k. 

2.4.   Choice 

Above  the  value  of  k  was  stated  to  be  of  the  order  of  log(n).  The  reasoning 
behind  this  is  that  the  latency  of  a  F&A  in  a  standard  omega  network  is  of  the 
form  A  +  2*B*log(n),  where  B  is  the  delay  of  a  single  (2-by-2)  switch,  and  A  is 
the  delay  of  the  memory  and  arithmetic.  We  might  expect  A  to  be  comparable 
with  2*3  in  a  system  in  which  the  network  time  docs  not  slow  memory  accesses, 
so  the  latency  of  a  F&A  request  is  of  the  order  of  4*log(n)  network  clock  times. 
Thus  a  value  of  k  of  log(n)  would  add  at  most  50%  to  the  latency  of  a  F&A 
request,  and  at  most  25%  on  average.  There  are  also  a  number  of  practical 
issues  which  affect  k.  In  the  SCN  described  here,  no  addresses  are  transmitted 
through  the  network,  so  the  bandwidth  required  is  halved.  In  practice  this  might 
mean  that  the  number  of  packets  is  halved  also,  giving  a  speed  up  of  a  factor  of 
3/2.  Furthermore,  since  the  SCN  is  a  tree,  it  has  a  more  compact  representation, 
(actually  planar),  giving  additional  speed-ups. 

The  combination  of  these  factors  would  suggest  that  with  a  value  of  k  = 
3*log(n),  a  latency  as  good  as  the  standard  network  might  be  achieved. 

3.  Theoretical 

The  main  theoretical  question  here  seems  to  be  whether  log(n)  F&A 
addresses  are  sufficient  to  handle  the  hot  spots  for  n  PEs,  even  if  optimally 
chosen.  This  seems  to  be  difficult  to  expect  in  general  --  one  would  certainly 
expect  that  doubling  the  size  of  a  system  would  generate  more  than  one  extra  hot 
spx)t.  However,  if  most  of  the  hot  spots  were  associated  with  global  resources, 
such  as  those  of  the  operating  system,  this  might  well  be  the  case. 

4.  Comments 

It  appears  that  determining  the  effectiveness  of  the  schemes  outlined  here 
will  require  more  data  than  we  have  at  the  moment,  and  also  a  considerable 
amount  of  experimentation.    Nomination  Algorithm  2  and  Selection  Algorithm  3 
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would  appear  to  be  superior  to  the  others,  but  we  have  no  quantitative  data. 

To  compare  the  various  schemes,  we  would  need  considerably  more  data  on 
F&A  statistics.  Ideally,  we  would  Uke  a  record  of  F&A  addresses  with  timings 
for  a  very  large  number  of  processors  nmning  real  applications  programs  and  the 
operating  system.  Data  without  the  OS  would  be  of  less  value,  since  the  OS  is 
expected  to  be  one  of  the  main  sources  of  F&A  hot  spots.  Similarly,  data  for  a 
small  number  of  processes  might  be  useful,  since  we  might  be  able  to  infer 
single- process  F&A  statistics;  however,  it  is  not  clear  how  dependent  such 
statistics  are  on  the  number  of  fjrocessors. 
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